Published on

AWS RDS SQLServer 还原故障 Troubleshooting

Authors
  • avatar
    Name
    Wang Zhiwei
    Twitter

背景

其它数据中心的 SQLServer 定期备份并将备份文件上传 S3,再还原到 AWS RDS(SQLServer),采用每周末一次全量备份+工作日每天一次差异备份的执行计划。

我使用 pymssql 库访问 AWS RDS,并执行 AWS 提供的 stored procedure 来进行数据库还原,T-SQL 如下,详见 AWS 文档。

exec msdb.dbo.rds_restore_database
	@restore_db_name='database_name',
	@s3_arn_to_restore_from='arn:aws:s3:::bucket_name/file_name.extension',
	@with_norecovery=0|1,
	[@kms_master_key_arn='arn:aws:kms:region:account-id:key/key-id'],
	[@type='DIFFERENTIAL|FULL'];

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/SQLServer.Procedural.Importing.html#SQLServer.Procedural.Importing.Native.Using.Restore


创建一个还原 Task 后,轮询 Task 执行状态。

T-SQL:

exec msdb.dbo.rds_task_status
	[@db_name='database_name'],
	[@task_id=ID_number];

代码逻辑如下,这里借助 redo 在数据库连接丢失后进行重试,重新建立数据库连接。

@redo.retriable(attempts=5, sleeptime=60, retry_exceptions=(DisconnectError,))
def wait(self, creds: SqlCreds, delay=60):
    """
    SQL Server also has an "Idle Connection Timeout" setting, which determines how long a connection
    can be idle (not used) before it is closed. This setting is controlled by the "Remote Query Timeout"
    configuration option and is set to 600 seconds (10 minutes) by default.
    """
    conn = creds.connect()
    try:
        while True:
            status = self.status(conn=conn)
            lifecycle = status["lifecycle"]
            database = self.meta["db_name"]
            if lifecycle in (LifeCycle.IN_PROGRESS, LifeCycle.CREATED):
                LOG.info(f"Waiting for {database} to be restored...")
                time.sleep(delay)
            elif lifecycle == LifeCycle.SUCCESS:
                return
            elif lifecycle == LifeCycle.ERROR:
                raise pymssql.OperationalError(
                    f"Error restoring {database}\t{json.dumps(status, default=str)}"
                )
            else:
                raise ValueError(
                    f"Unknown lifecycle: {lifecycle}\t{json.dumps(status, default=str)}"
                )
    except pymssql.OperationalError as e:
        LOG.error(f"Error waiting: {e}")
        if creds.is_disconnect(e, conn):
            raise DisconnectError(f"PyMSSQL connection has gone away: {e}")
    finally:
        conn.close()

异常 1

这是 RDS rds_task_status 的执行异常日志:

[2023-02-17 03:28:23.083] Aborted the task because of a task failure or a concurrent RESTORE_DB_DIFFERENTIAL request.
[2023-02-17 03:28:23.643] Task has been aborted
[2023-02-17 03:28:23.910] Task was initiated on server name: EC2AMAZ-7POJKEI, current server name: EC2AMAZ-T85JIN4. Cannot complete the task, please try again.

异常 2

这是 Python 程序的执行异常日志:

2023-02-17 02:02:33,089: ERROR : Error waiting: (20047, b'DB-Lib error message 20047, severity 9:\nDBPROCESS is dead or not enabled\n')
2023-02-17 02:04:02,066: INFO : retry: calling wait with args: (FullBackup(id='MARTPH_20230217|2023-02-11|FULL|1|2023-02-17'), <bayer_cdp_common_utils.mssql_handler.SqlCreds object at 0x7ff9144bc4e0>), kwargs: 
{}
, attempt #3
2023-02-17 02:07:17,487: INFO : retry: calling wait with args: (FullBackup(id='MARTPH_20230217|2023-02-11|FULL|1|2023-02-17'), <bayer_cdp_common_utils.mssql_handler.SqlCreds object at 0x7ff9144bc4e0>), kwargs: 
{}
, attempt #4
2023-02-17 02:11:41,495: INFO : retry: calling wait with args: (FullBackup(id='MARTPH_20230217|2023-02-11|FULL|1|2023-02-17'), <bayer_cdp_common_utils.mssql_handler.SqlCreds object at 0x7ff9144bc4e0>), kwargs: 
{}
, attempt #5
2023-02-17 02:12:41,536: INFO : retry: Giving up on wait
Traceback (most recent call last):
  File "/tmp/runscript.py", line 230, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-l7dpyz1k/ph-cdw-sqlserver-restore.py", line 90, in <module>
  File "/tmp/glue-python-scripts-l7dpyz1k/ph-cdw-sqlserver-restore.py", line 56, in _wait_for_restore
  File "/glue/lib/installation/bayer_cdp_common_utils/rds_restore_handler.py", line 837, in wait_for_restore
    bak.full_backup.wait(creds)
  File "/glue/lib/installation/redo/__init__.py", line 215, in _retriable_wrapper
    return retry(func, args=args, kwargs=kwargs, *retry_args, **retry_kwargs)
  File "/glue/lib/installation/redo/__init__.py", line 170, in retry
    return action(*args, **kwargs)
  File "/glue/lib/installation/bayer_cdp_common_utils/rds_restore_handler.py", line 408, in wait
    conn = creds.connect()
  File "/glue/lib/installation/bayer_cdp_common_utils/mssql_handler.py", line 36, in connect
    autocommit=autocommit,
  File "src/pymssql/_pymssql.pyx", line 653, in pymssql._pymssql.connect
pymssql._pymssql.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (fusion-q-1.cpmknfgp0x9h.rds.cn-north-1.amazonaws.com.cn)\nNet-Lib error during Connection timed out (110)\nDB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (fusion-q-1.cpmknfgp0x9h.rds.cn-north-1.amazonaws.com.cn)\nNet-Lib error during Connection timed out (110)\n')

Troubleshoots

从以下部分日志可以发现因为数据库服务的问题连接会话中断了,并且三次重新连接都失败了。

2023-02-17 02:02:33,089: ERROR : Error waiting: (20047, b'DB-Lib error message 20047, severity 9:\nDBPROCESS is dead or not enabled\n')
...
, attempt #5
2023-02-17 02:12:41,536: INFO : retry: Giving up on wait

以上问题出现了两次,第一次出现时我以为是 pymssqlconnection 超时被 server 主动中断了,于是我加上 redo 机制,希望在连接丢失后重新创建并使用,几天之后发生了同样的报错。

我非常纳闷,并在看到 GitHub 上的相似 issue 后怀疑是重试创建连接的方式有问题。

https://github.com/rails-sqlserver/activerecord-sqlserver-adapter/issues/402#issuecomment-1356793443


我又认真看了日志,试图找出一些蛛丝马迹。按照代码执行流程,下面是第一步的报错信息:

(20047, b'DB-Lib error message 20047, severity 9:\nDBPROCESS is dead or not enabled\n')

这是下一步 reconnect 的报错:

(20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (xxx.amazonaws.com.cn)\nNet-Lib error during Connection timed out (110)\nDB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (xxx.amazonaws.com.cn)\nNet-Lib error during Connection timed out (110)\n')

显然之前没有仔细看,虽然都是数据库连接的异常,但两次异常不能视之相同,第一次异常是数据库连接访问不到数据库了,第二次异常是连接超时(Connection timed out)。


先暂停去看异常 1,在官方文档中找到解释:

Task was initiated on server name: EC2AMAZ-ABCDEFG, current server name: EC2AMAZ-MNOPQRS. Cannot complete the task, please try again. Usually you’ll see this error when a task is in progress on a mirrored instance and a failover happens. The new principal or primary does not know about the task in progress. This error can also appear when the task is in progress and the underlying EC2 machine has been replaced with a different one. In either case, reissue the task to solve the error.

Migrating Microsoft SQL Server Enterprise Workloads to Amazon RDS: Part 1 | AWS Database Blog

很让人意外,RDS 实例底层的 EC2 节点难道换了?我决定去 AWS RDS 控制台看一下。

💡 这里补充一点,此处失败的还原的任务是差异备份的还原(DIFFERENTIAL),分为两步,第一步还原全量备份的 bak 文件,第二步还原差异备份的 bak 文件,两步合在一起为一次完整的 SQLServer 差异还原。


首先注意到 RDS 实例开启了自动备份的功能。

Untitled.png

Untitled.png


可以从 events 中看到最近备份的记录。

Untitled.png


我翻到 RDS 服务日志,看看发生了什么,有没有异常记录。

Untitled.png

2023-02-17 02:34:16.42 Server      Microsoft SQL Server 2016 (SP3) (KB5003279) - 13.0.6300.2 (X64) 
	Aug  7 2021 01:20:37 
	Copyright (c) Microsoft Corporation
	Standard Edition (64-bit) on Windows Server 2016 Datacenter 10.0 <X64> (Build 14393: ) (Hypervisor)

2023-02-17 02:34:16.43 Server      UTC adjustment: 0:00
2023-02-17 02:34:16.43 Server      (c) Microsoft Corporation.
2023-02-17 02:34:16.43 Server      All rights reserved.
2023-02-17 02:34:16.43 Server      Server process ID is 4760.
2023-02-17 02:34:16.43 Server      System Manufacturer: 'Amazon EC2', System Model: 'm5.xlarge'.
2023-02-17 02:34:16.43 Server      Authentication mode is MIXED.
2023-02-17 02:34:16.43 Server      Logging SQL Server messages in file 'D:\rdsdbdata\Log\ERROR'.
2023-02-17 02:34:16.43 Server      The service account is 'awscn\EC2AMAZ-T85JIN4$'. This is an informational message; no user action is required.
2023-02-17 02:34:16.43 Server      Registry startup parameters: 
	 -d D:\rdsdbdata\DATA\master.mdf
	 -e D:\rdsdbdata\Log\ERROR
	 -l D:\rdsdbdata\DATA\mastlog.ldf
	 -k 20.000000
	 -T 3226
2023-02-17 02:34:16.43 Server      Command Line Startup Parameters:
	 -s "MSSQLSERVER"
...

2023-02-13 22:08:52.60 Server      Microsoft SQL Server 2016 (SP3) (KB5003279) - 13.0.6300.2 (X64) 
	Aug  7 2021 01:20:37 
	Copyright (c) Microsoft Corporation
	Standard Edition (64-bit) on Windows Server 2016 Datacenter 10.0 <X64> (Build 14393: ) (Hypervisor)

2023-02-13 22:08:52.60 Server      UTC adjustment: 0:00
2023-02-13 22:08:52.60 Server      (c) Microsoft Corporation.
2023-02-13 22:08:52.60 Server      All rights reserved.
2023-02-13 22:08:52.60 Server      Server process ID is 144.
2023-02-13 22:08:52.60 Server      System Manufacturer: 'Amazon EC2', System Model: 'm5.xlarge'.
2023-02-13 22:08:52.60 Server      Authentication mode is MIXED.
2023-02-13 22:08:52.60 Server      Logging SQL Server messages in file 'D:\rdsdbdata\Log\ERROR'.
2023-02-13 22:08:52.60 Server      The service account is 'awscn\EC2AMAZ-7POJKEI$'. This is an informational message; no user action is required.
2023-02-13 22:08:52.60 Server      Registry startup parameters: 
	 -d D:\rdsdbdata\DATA\master.mdf
	 -e D:\rdsdbdata\Log\ERROR
	 -l D:\rdsdbdata\DATA\mastlog.ldf
	 -k 20.000000
	 -T 3226
2023-02-13 22:08:52.60 Server      Command Line Startup Parameters:
	 -s "MSSQLSERVER"
...

很明显,RDS 实例被重启了,并且 EC2 节点也更换了,重启的时间间隔为 7 天,对应自动备份设置的时间。到这里,两个异常的原因都找到了。

RDS 实例重启导致 pymssql 的连接丢失并且重试失败,RDS 重启切换 EC2 导致差异还原两个步骤执行在不同的 EC2 上。